In the era of digital media, the spread of misinformation has become a significant concern. Fake news can have devastating consequences, influencing public opinion, and undermining trust in institutions. This project tackles the critical task of developing a fake news detection model, leveraging a comprehensive dataset of labeled news articles. By applying machine learning and natural language processing techniques, I aim to identify patterns and characteristics that distinguish true from fake news, and create a reliable model that can accurately classify news articles, ultimately contributing to a more informed and trustworthy online environment.
The data cleaning procedures can be found in cleaning.ipynb notebook. Data cleaning involved: * Striping the source. All true text’s began with a specific news source indicating string: “LOCATION (Reuters) -” to not over-fit on this particular string this part was removed from true news texts.
Extra space removed. All true texts’s had an extra whitespace at the end which was removed.
Removing items with bad dates. Some fake texts had links instead of dates these items were removed (10 items)
Removing headlines with links. A over 3000 fake headlines contained links compared to only 2 true headlines. To not over-fit on this particular feature headlines with links were removed.
Removing headlines with no intelligible text. Using the langdetect module 673 Fake headlines and one true headline were detected as not english. These texts contained single words or unintelligible text and thus were removed.
Dropping duplicates. 225 true and 4592 fake duplicated headlines were removed.
Removing very short headlines. 226 fake headlines with less than 20 words were dropped.
Data points left after cleaning:
Show the code
len(merged_data)
35847
4 Exploratory Data Analysis
4.1 Data Distribution
Example Fake Items:
Show the code
disf.table_display(fake_headlines.head(2))
title
text
subject
date
language
WATCH: Kellyanne Conway Pathetically Begs People To Buy Ivanka Trump’s Products On Fox News
Donald Trump is clearly using the office of the presidency to enrich himself and his family, a clear violation of the Constitution.In January, the retailer Nordstrom informed Ivanka Trump of their decision to drop her product line from their stores. A boycott of Ivanka s products and any retailer who sells them contributed to the decision. That s the way the free market works.But after the decision became public earlier this week Trump lost his shit over it on Wednesday.My daughter Ivanka has been treated so unfairly by @Nordstrom. She is a great person always pushing me to do the right thing! Terrible! Donald J. Trump (@realDonaldTrump) February 8, 2017Nordstrom responded by announcing that the decision was made purely for business reasons and that Ivanka was informed last month and understood the choice they made. Since then, Nordstrom s stock has risen and other retailers have also dropped Ivanka s products.And so, Donald Trump sent Kellyanne Conway on Fox News on Thursday to give free advertising to Ivanka s business. They are using the most prominent woman in Donald Trump s life, Conway began. Using her, who has been a champion for empowerment of women in the workplace, to get to him. I think she s gone from 800 stores to a thousand stores, or a thousand places where you can buy. You can buy her goods online, she continued before literally telling Fox viewers to go buy her products. Go buy Ivanka s stuff is what I would tell you. I hate shopping, I m going to go get some myself today. This is just wonderful line. I own some of it. I fully I m going to give a free commercial here. Go buy it today, everybody. You can find it online. Here s the video via YouTube.Seriously. Kellyanne Conway actually shilled for Ivanka s products during an interview. This is not only an unconstitutional conflict of interest, it s completely unfair. Trump is literally using his platform to advertise his daughter s business, a business she was supposedly going to walk away from because she joined Trump s White House team. Clearly, she lied just like her daddy did.The fact that Conway appeared on Fox to advertise for the Trumps demonstrates that Donald Trump only sees the presidency as a way to enrich himself and his family. He doesn t give a damn about the economy or foreign policy or any other problems we face. He only cares about personal profit and this is proof of his selfishness.Featured Image: Yana Paskova/Getty Images
News
2017-02-09 00:00:00
en
“LIBERAL BULLY” Middle School Teacher Tells Students 13 Yr. Old Black Conservative’s “Not Worth Saving in a Fire” [VIDEO]
Outspoken conservative CJ Pearson hasn t heard from the White House and doesn t expect to receive an invitation A teacher at Columbia Middle School in Evans, Georgia allegedly told his students that outspoken 13-year-old black conservative CJ Pearson was not worth saving in a fire, and that he hated him. This is just the latest example of how liberals seem to believe that hate speech is acceptable so long as it s directed at those on the right and especially minority conservatives.Via Paul Joseph Watson at Infowars:Pearson previously made headlines after his Twitter account was blocked by President Obama s official Twitter account following a video in which Pearson criticized Obama over his response to the Clock Kid controversy. White House officials also made fun of the teenager.Pearson was told by several other students in his class that teacher Michael Garrison said CJ is not worth saving in a fire and that he hates him. The teacher also accused Pearson of cheating on a vocabulary test when he was in sixth grade, a claim that Pearson denies. It s always great having a teacher that s not only a liberal bully, but someone who engages in slander, Pearson told BizPac Review. My words are bold and I don t expect everyone to agree. But to have a teacher say this about me? Completely inexcusable. School principal Eli Putnam has promised a full investigation into the matter. Pearson accuses the teacher of violating the school s bullying policy. Via: Gateway PunditHere s conservative CJ Pearson asking Barack Obama: Does every Muslim that can build a clock gets a presidential invitation?
left-news
2015-10-18 00:00:00
en
Example True Items:
Show the code
disf.table_display(true_headlines.head(2))
title
text
subject
date
language
Zambian president urges unity as government, opposition prepare for talks
Zambian President Edgar Lungu on Friday called for unity among political groups ahead of talks between the government and the opposition aimed at reconciliation after a political crisis earlier this year. The leader of the opposition United Party for National Development (UPND), Hakainde Hichilema, was arrested with five others in April and charged with plotting to overthrow the government after his convoy failed to make way for Lungu s motorcade. The case stoked political tensions in Zambia, a major copper producer and seen as one of Africa s more stable and functional democracies, following a bruising election last year. Hichilema was freed from prison in August after the state dropped the charges, to pave the way for dialogue between the two sides following mediation by Commonwealth Secretary-General Patricia Scotland. Scotland s special envoy Ibrahim Gambari is in Zambia and has separately held talks with Lungu, Hichilema and other opposition leaders. In an address at the opening of the national assembly, Lungu said Zambians could disagree and quarrel but would always remain one. The factors that unite us are much greater than those that seek to divide us, he said. Opposition UPND members of parliament, who boycotted Lungu s last address, attended Friday s session, saying their attendance would give confidence to the process of dialogue. The UPND MPs took this decision in the interest of the country in view of the forthcoming political dialogue, their spokesman Jack Mwiimbu said in a statement.
worldnews
2017-09-15 00:00:00
en
Dozens of unidentified bodies found near Libyan city of Benghazi
The bodies of 37 unidentified people have been found near the eastern Libyan city of Benghazi, security sources said on Friday. The bodies were found on Thursday night in Al-Abyar, about 70 km (44 miles) east of Benghazi. The security sources gave no information about their possible identity. Smaller numbers of bodies have been found in and around Benghazi on several occasions in recent months. The area is controlled by the Libyan National Army (LNA), a force headed by eastern-based commander Khalifa Haftar. He declared victory in a campaign for Benghazi in July, though some fighting has continued in one district of the city.
Figure 9: Distributions of token counts of headline texts when tokenized with Bert’s uncased tokenizer. The red line indicates 512 tokens (Bert context window limit.)
Notably, fake headlines tend to be shorter in length. However, a significant proportion of both true and fake headlines exceed the typical context length of “classical” NLP neural network models, such as BERT. This raises concerns about the potential need for larger context windows to effectively identify fake news, as older models may not be equipped to handle longer headlines.
A baseline model is set based on heuristic rules derived from commons words with the largest difference in occurrence between True and Fake news. The existence of a word more common to fake news inside a given text yields a point while the existence of a word more common in true news deduces a point. At the end of the evaluation texts with scores larger than 0 are classified as fake news.
Baseline Predictions:
Show the code
def base_line_prediction(text): score =0for word in words_common_fake:if word in text: score +=1for word in words_common_true:if word in text: score -=1if score >0:return1else:return0baseline_predictions = val_data["text"].map_elements(base_line_prediction)
The code for Naive Bayes (NB) model training can be found inside naive_bayes.ipynb notebook.
Best options for text data representation were determined during the hyperparameter tuning process. The results yielded that:
For headline text data, the Naive Bayes model yielded the best F-1 score of 0.88 when trained on bi-grams (pairs of consecutive words).
For title data, the highest F-1 score of 0.64 was obtained when training the Naive Bayes model on single words.
Despite stacking the Naive Bayes models using Logistic Regression, the results did not demonstrate any improvement.
6.2 Tree Based Models
The code for tree-based model training can be found inside trees.ipynb.
The next iteration of the fake news classifier involves text feature engineering and tree-based models. Several new features were added to the dataset, including:
Token count: The total number of tokens in the title and headline text of each item
Capital letter count: The number of capital letters per token in the headline text and title for each item
Sentiment features: Vader sentiment intensity features for each item
Naive Bayes probabilities: The probabilities generated by the previously described Naive Bayes models
Two types of models were trained on this expanded feature set: LightGBM with different algorithms and a single decision tree. Hyperparameter tuning was performed for each model to optimize their performance.
The best results were achieved with a LightGBM model that utilized the Dropouts meet Multiple Additive Regression Trees algorithm, with a learning rate of 0.06, a maximum of 71 leaves, and 163 estimators. An F1-score of 0.980 was achieved with this model, significantly outperforming the previous Naive Bayes models. The addition of capital letter count and text length features was found to be particularly effective in improving the model’s accuracy.
6.3 Neural Network Models
The code for neural-network model training can be found inside nn.ipynb notebook. Several experiments using neural network models were conducted.
Model Training Description
Data Loader. The data loader divides the data into training, validation, and test sets based on dates as described above.
Loss Function. Binary Cross entropy was used.
Fine-tuning is initiated based on a loss delta. The backbone layers are unfrozen with a lower learning rate.
Early Stopping. Model training halts using a loss delta metric. The model parameters from the best epoch are restored.
Logging. Experiments are logged using MLFlow. Parameters, metrics, and models from each experiment are uploaded to Databricks (Community Edition).
While the Bert model’s complexity was significantly higher, it still managed to outperform the naive Bayes and engineered feature-based tree models by a mere 0.06 F-score points, demonstrating its effectiveness despite the added complexity.
6.5 Adversarial Testing
In order to evaluate to what extent the models would perform if the global news context were to change, the names of the two most commonly mentioned politicians were altered in the validation data.
Both models lost only 0.001 F1-score if the names of the candidates are changed. Bert model will be used further on as it does outperform other models slightly.
Figure 10: The precision, recall and F1 scores of the BERT model’s predictions with different decision thresholds.
The model has an extremely high recall throughout the decision threshold range which means it can be increased up to 0.95 for an increased precision and F1-Score.
Figure 15: Lime explanation of a erroneous Fake news prediction.
Overall, the model’s predictive performance is impressive, despite the difficulty in identifying the specific features that contribute to its decisions. This complexity may actually be a beneficial trait in the context of fake news detection, where adversaries are constantly attempting to evade detection. By relying on features that are not immediately apparent, the model may be more resilient to attempts to manipulate or evade its detection, making it a more effective tool in the fight against disinformation.
8 Further Considerations and Potential Improvements
The difficult nature of fake news detection, coupled with the relentless efforts of troll farms and the dynamic global news landscape, underscores the need for continuous model updates. In light of these challenges, it is advisable to deploy a suite of models leveraging diverse features and update them regularly to stay ahead of the curve.
For the model presented in this project, several areas of immediate improvement are identified:
Incorporating language models with larger context windows could enhance prediction reliability, as most texts exceed the 512-token limit. The Longformer with an additional 100 tokens showed promising results.
Developing additional models with comparable performance, but based on distinct text features, would create a robust news filtering system.
More comprehensive adversarial testing of the models is recommended to simulate real-world scenarios.
A more diverse dataset should be employed in the future to mitigate over-fitting on specific writing styles, as most true news samples originated from a single source.
9 Conclusions
The dataset consisted of 45K news items, with 9K removed due to errors or unintelligibility. The cleaned dataset contained 15K true news items and 20K fake news items.
The distribution of true and fake news was uneven over time, with a time split used to create a balanced training set and more true news in the validation and test sets.
Analysis of the text revealed that common words in both fake and true news included people and locations, but fake news had more names, adverbs, and prepositions, while true news had more verbs.
The headline text was up to 2000 tokens long, with the upper quartile being 645 for true and 625 for fake news.
A baseline F1-score of 0.46 was achieved using a heuristic approach, while a lightweight model using LightGBM and Naive Bayes probabilities achieved a F1-score of 0.980.
The best-performing model was a BERT-based model with a custom classifier head, achieving an F1-score of 0.993 on the test set.
The model exhibited excellent recall and could tolerate a decision threshold of 0.95 with minimal loss of recall.
Adversarial testing showed that the model’s performance was marginally affected by changes in the global news context.
The model’s complex text features made it difficult to distinguish for humans, which could make it harder for adversaries to break.